-
Notifications
You must be signed in to change notification settings - Fork 359
bugfix: missing fields in doc when using --log_samples #731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
bugfix: missing fields in doc when using --log_samples #731
Conversation
Bug: Fields with "image" in keys or of type dict are not saved in the sample log file. Fix: save dicts and all fields to the JSONL file
Warning Rate limit exceeded@Luodian has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 10 minutes and 39 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThe update modifies the internal logic of the Changes
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we add this hardcoded logic because we don't want to save these large content (image, audio) into the the jsonl files for faster processing logic
The reason to keep these fields with "image" is that people may have "image_size", "image_description", or "image":[dict] in the doc that will be used for metric calculation. Note that dictionary is not saved in the current latest version. |
I think in the later code lmms-eval/lmms_eval/evaluator.py Lines 540 to 556 in e24a7d8
the saved doc will be added to the examples and will be added to the save doc when log samples is being used? May I ask where will this being used for metric calculation? I think the save doc is after the process result step. |
Yes, the structure of the log file is like this: example = {
"doc_id": doc_id,
"doc": saved_doc,
"target": target,
"arguments": filtered_arguments,
"resps": [req.resps for req in requests],
"filtered_resps": [req.filtered_resps[filter_key] for req in requests],
"doc_hash": hash_string(
json.dumps(
requests[0].doc,
indent=2,
default=handle_non_serializable,
ensure_ascii=False,
)
),
"prompt_hash": hash_string(requests[0].arguments[0]),
"target_hash": hash_string(str(target)),
} where The problem is that we are filtering the [RAW-doc-dict>] such that fields like "image_path" and "image_metainfo" are not saved in the log file. For example, if I have a custom HF dataset with complicated data structure and it comes with a data loading script (in HF dataset repo) that downloads the image data and returns the "image_path" when loaded with Why it is desirable to keep the integrity of the raw doc info?
|
The concern here is that in many many datasets currently integrated by lmms-eval, even the text part of the previous saved doc contains lots of content. When we performing evaluation, the saved doc would causing the saving results to be extreme slow and requires 200+ Mb to store the file. I would suggest this to be an optional choice to save full doc instead of hardcode it here |
I agree. It would be nice to have an argument controling this behavior and increase the transparency of this process. In my case, I discovered this issue after a few runs. Thank you very much! |
Issue: #712 (comment)
Bug ❓: Fields with "image" in keys or of type dict are not saved in the sample log file.
Fix: save dicts and all fields to the JSONL file
Testing: The update code has been tested with image-text-to-text inference. All fields including dicts in doc are saved in the log file.
Summary by CodeRabbit